The purpose of this notebook is to get an overview of the data included in the dataset data/immo_data202208_v2.parquet
The dataset contains 22481 (9126 more) rows and 134 (26 more) columns.
We've identified that the dataset contains data on the following features:
| Feature | Columns |
| ----------------- | ------- |
| Availability | Availability (=), Availability_merged (=), Disponibilità (=), Disponibilité (=), Verfügbarkeit (=), detail_responsive#available_from (=) |
| Address | Commune (=), Comune (=), Gemeinde (=), Municipality (+9126), Municipality_merged (=), detail_responvice#municipality (=), address (=), address_s (new), Locality, location (+9126), location_parsed (+9126), Zip (+9126), plz (new), plz_parsed (new) |
| Coordinates | Latitude (+9126), lat (+9126), Longitude (+9126), lon(9126) |
| Floor | Floor (+4636), Floor_merged (=), Floor_unified (new), Piano (=), Stockwerk (=), Étage (=), detail_responsive#floor (=) |
| Floor space | Floor space (=), Floor space: (new), Floor_space_merged (=), Minimum floor space (new), Nutzfläche (=), Superficie utile (=), Surface utile (=), detail_responsive#surface_usable (=) |
| Gross return | Gross return (=), Gross yield (new) |
| Plot area | Grundstücksfläche (=), Land area: (new), Plot area (=), Plot_area_merged (=), Plot_area_unified (new), Superficie del terreno (=), Surface de terrain (=), detail_responsive#surface_property (=) |
| Living space | Living space (=), Living_area_unified (new), Living_space_merged (=), Superficie abitabile (=), Surface habitable (=), Surface living: (new), Wohnfläche (=), detail_responsive#surface_living (=)|
| Environment | NoisePollutionRailwayL (+9126), NoisePollutionRailwayM (+9126), NoisePollutionRailwayS (+9126), NoisePollutionRoadL (+9126), NoisePollutionRoadM (+9126), NoisePollutionRoadS (+9126), PopulationDensityL (+9126), PopulationDensityM (+9126), PopulationDensityS (+9126), RiversAndLakesL (+9126), RiversAndLakesM (+9126), RiversAndLakesS (+9126), ForestDensityL (+9126), ForestDensityM (+9126), ForestDensityS (+9126), WorkplaceDensityL (+9126), WorkplaceDensityM (+9126), WorkplaceDensityS (+9126), distanceToTrainStation (+9126) |
| gde | gde_area_agriculture_percentage (+9126), gde_area_forest_percentage (+9126), gde_area_nonproductive_percentage (+9126), gde_area_settlement_percentage (+9126), gde_average_house_hold (+9126), gde_empty_apartments (+9126), gde_foreigners_percentage (+9126), gde_new_homes_per_1000 (+9126), gde_politics_bdp (+5256), gde_politics_cvp (+9108), gde_politics_evp (+8653), gde_politics_fdp (+9121), gde_politics_glp (+5413), gde_politics_gps (+9098), gde_politics_pda (+4578), gde_politics_rights (+5117), gde_politics_sp (+9123), gde_politics_svp (+9124), gde_pop_per_km2 (+9126), gde_population (+9126), gde_private_apartments (+9126), gde_social_help_quota (+9126), gde_tax (+9126), gde_workers_sector1 (+9126), gde_workers_sector2 (+9126), gde_workers_sector3 (+9126), gde_workers_total (+9126) |
| Price | price (+9126), price_cleaned (+9126), price_s (new) |
| Rooms | No. of rooms: (new), rooms (+8868) |
| Type | type (+9126), type_unified (new) |
| Homegate features | features (new), Volume: (new), Room height: (new), Number of toilets: (new), Number of floors: (new), Number of apartments: (new), Last refurbishment: (new), Year built: (new) |
Many features are contained in multiple columns. This and this notebook explores how they can be aggregated.
# Import modules
import pandas as pd
import numpy as np
import sweetviz as sv
df = pd.read_parquet(
"https://github.com/Immobilienrechner-Challenge/data/blob/main/immo_data_202208_v2.parquet?raw=true"
)
df.shape
(22481, 134)
df_v1 = pd.read_csv(
"https://raw.githubusercontent.com/Immobilienrechner-Challenge/data/main/immoscout_cleaned_lat_lon_fixed_v9.csv",
low_memory=False,
)
df_v1 = df_v1.drop_duplicates(subset="link")
# Reorder columns alphabetically and show sweetviz report
sweet_report = sv.compare([df, "V2 data"], [df_v1, "V1 data"])
sweet_report.show_notebook()
| | [ 0%] 00:00 -> (? left)